Transmission of Genetic Information

Introduction

Motivation and Goals

Many topics within the study of genetics have a strong link to numerical or quantitative analysis. Long before the identification of DNA as the primary carrier of genetic information, both scientists and non-scientists were conducting breeding experiments. Initially these “experiments” had the goal of domesticating animals for human benefits (e.g., cattle, pigs, goats, dogs). Although not true experiments as we know them today, the end result of those efforts played a large part in the rise of agriculture and civilizations.

At least ten of thousand years passed before the investigations of predictable breeding phenotypes in pea plants by Gregor Mendel (1822-1884), which ultimately led to modern genetics. The study of genetics during the first half of the 20th century was carried out without an understanding of the structure of DNA. Even without this specific knowledge, scientists were able to make great strides in understanding the patterns of evolution in populations. Much of this research was aided by simultaneous advances in statistics during the same time period, which in many cases were being developed to help make sense of the vast array of data that was being collected.

We have developed these exercises to give you an introduction to some of the statistical approaches that are used in the study of genetics. We hope that you will use these to further develop your understanding of topics in genetics that you are covering in lecture and that you will begin to gain some intuition about the the methods used to test hypotheses. Some of you might even wish to go deeper into understanding the math and computer code behind these exercises.

Because quantitative aspects of genetics necessarily involve numbers and equations, we will need to use them. By working through examples, we hope to demystify the numbers. Complex math is not usually necessary, and most can be accomplished with only addition, subtraction, multiplication, and division.

Finally, we will use computer code to carry out analyses and generate plots. These exercises are not designed to be a course in coding, and you do not need to learn any additional material to use the exercises. You will be able to work through by just clicking the “Run Code” button, entering or changing a few values, and clicking “Run Code” again. As mentioned above, if you are interested in the computational side, we encourage you to edit, explore, and learn all you can.1 But that is not necessary to use these exercises for your benefit.

How to use these exercises

This web page (https://biosc-2200.github.io/TGI/) is refreshed each time you visit. It is available any time, both on and off campus. It should look the same on laptops, tablets, and phones, although the text will be necessarily small on your phone.

The first and most important directive is to read carefully, because there is no need to rush through. The exercises are designed to be worked linearly. Each piece builds on the those that come before it. Work through deliberately, when you get to a code chunk (see below), pay attention to any edits or data that you need to input.

There are prompts throughout in that ask you to think, make a prediction, etc. It might be useful to save these responses to a separate document or save/print this page when you are done. Unfortunately, there is no way to save your work.

These prompts look like this:

Take a few moments to think about what you hope to gain from these exercises.

R and webR

These exercises use the statistical programming language R for analysis and plotting. You don’t need to already know any coding in R or any other language. Everything you need to know will be provided.

On this web page, R is running from within your web browser using the webR framework, which means that you do not need to install any software (nor is any software permanently installed on your device).

When you first load the web page, it will take a few seconds for the software to load. When that is complete, you will see the icon

at the top of the page. Once you see the green “Ready!” sign, everything is set.

Code blocks

As you work through, you will encounter code blocks. Some code blocks will run when the page is first loaded (no interaction required on your part). Other code blocks won’t execute until you ask.

These code blocks will look like this, with a “Run Code” button:

Clicking the “Run Code” button will cause R to run the code and print the result (4). The simplest example is to just use R as a fancy calculator. If you haven’t already, click “Run Code” above.

Feel free to try different values and different operators (+, -, *, /, ^). You can’t break anything. The worst that will happen is that you will get a syntax error. Execute the code below to see what a syntax error looks like.

You can edit the block to make the code execute successfully, or just leave it and move on.

Learning Objectives

The learning objectives for this exercise are:

  • Calculate the probability of a particular gamete being produced from an individual, assuming independent assortment.
  • Calculate the probability of a particular genotype in the offspring, given independent assortment and random fertilization between two individuals.
  • Design genetic crosses to provide information about genes, alleles, and gene functions.
  • Use a Chi-squared test to determine how well data from a genetic cross fits theoretical predictions.
  • Explain how sample size is related to the certainty of estimates (CIs, Chi-squared test).

Review of Inheritance

When gametes are produced, each gamete receives a single copy of each gene. Because each of the F1 parents has two copies, there are four possible alleles that can be passed on (for example two copies of D and two copies of d in Figure 1). A Punnett square can be used to enumerate the possible genotypic combinations that result from the proposed cross. By joining the genes from each parent, the resulting possible offspring genotypes can be determined. These genotypes can be converted into phenotypes if the nature of dominance of the alleles in known (Figure 1).

Figure 1: The process of gamete formation leads to predictable genotypic and phenotypic proportions in the F2 generation. Image modified from Klug et al. (2019)

The 1:2:1 genotypic ratio and 3:1 phenotypic ratio in the hererozygote cross example above represent theoretical probabilities for the distributions of genotypes and phenotypes. These ratios can be tested statistically to determine if a sample deviates significantly from the expected proportions.

Proportions and Probabilities

When thinking about genotypes and phenotypes, it is often useful to sometimes think of counts as proportions and sometimes as probabilities. The 1:2:1 genotypic ratio can equally be thought of as

  1. Porportions: 1/4 DD, 2/4 Dd, and 1/4 dd
  2. Percentages: 25% DD, 50% Dd, and 25% dd
  3. Probabilities: 0.25 DD, 0.5 Dd, and 0.25 dd

This set of proportions and probabilities all sum to one and represents the full range of possible combinations.

Probabilities are very predictable in the long run but are often unpredictable in the short run. Consider only one gamete being produced, say a sperm. Each sperm that is produced from a heterozygote Dd can carry either the D or d allele.

We can simulate the production of a single sperm with the following code. First we create an object that holds the possible alleles. The code sample(allele, size = 1) randomly samples 1 of the possible alleles. Run this code a few times to see the first few gametes produced.

You probably noticed that, quite often, sequences of D or d were produced.

The code block below repeats this sampling process 10 times and counts up the numbers of D and d gametes produced. replicate() generates the samples and table() counts.

How many D and d gametes do you predict? How confident are you in your answer?

Run the code a few times to generate new samples of 10 gametes. Revisit your prediction.

Now gradually increase the number of gametes produced by changing n = 10 to n = 50, n = 100, etc.

What happens to the relative counts of D and d as n increases? What does this tell you about our ability to predict the exact counts either when sample sizes are small or when sample sizes are large? In general, how often do you see exactly 50% D and 50% d?

Blood types in humans

To explore the concepts of independent assortment and probabilities, we will use the blood-typing system that is used to classify human blood types based on the expression of different antigens on the surface of red blood cells (Figure 2). Blood types are important for transfusion medicine, where the mixing of incompatible blood types can have fatal results.

Figure 2: Three of the components of blood are white blood cells (involved in the immune response), platelets (involved in blood clotting), and red blood cells, which primarily carry oxygen. Image from learn.genetics

Blood typing in humans primarily uses the ABO system, although almost 30 other blood typing systems exist, which focus on other aspect of red blood cell structure and physiology.

A and/or B refer to the presence of A and/or B antigens of the surface of the red blood cells. O denotes the absence of both A and B antigens. The American Red Cross has a website with animations about what blood types are compatible with other blood types. In general, A must be matched with A and B with B or a cross-reaction will take place. Because type O blood lacks both A and B antigens, it is considered a “universal donor”. And because type AB blood has both A and B antigens, it is considered a “universal recipient”.

Figure 3 shows the Punnett square for human blood types. Each parent can provide alleles for A, B, or no antigens (O), leading to a complex 1:2:1:1:1:2:1 ratio.

Figure 3: The Punnett square for human ABO alleles and associated blood types. Image from learn.genetics

The right side of Figure 3 shows the blood phenotypes (i.e., phenotypes) associated with each of the possible genotypes. Although there are 7 possible genotypes, there are only 4 possible blood types: A, B, AB, and O.

You may have wondered why there is no type AO or BO blood and how type AB blood is produced.

  1. Because the O allele has no associated antigen, an individual with A+O alleles will have type A blood and an individual with B+O will have type B blood (the same is true for O+A and O+B). So we can say that A and B are dominant with respect to O.
  2. Because the A and B alleles produce A and B antigens, respectively, an individual with A+B or B+A will have type AB blood. The A and B alleles are thus codominant.

Rh Factor

Another important feature of red blood cells that is important for transplant medicine and the Rh factor. Blood’s Rh factor is more complicated than the ABO system, with approximately 50 different antigens involved (Dean 2005). Because a few proteins are most commonly involved, this system can be summarized and Rh+ and Rh- (Rh-positive and Rh-negative).

ABO type and Rh factor are determined by separate genes, meaning that Rh factor is inherited separately from ABO type. Thus the rules of Independent Assortment apply. Each ABO blood type can be associated with Rh+ or Rh-.

Probabilities of blood types in a population

The distributions of A, B, AB, and O blood types as well as Rh factors differs among populations. Among a hypothetical population, blood type probabilities for all the combinations of AB antigens and Rh status are summarized in the following table.

ABO Rh Factor Probability
A - 0.005
A + 0.270
AB - 0.001
AB + 0.070
B - 0.004
B + 0.250
O - 0.010
O + 0.390

The relative proportions of the eight different blood types can be shown on a bar chart, where the height of the bar is proportional to the probability of an individual having a specific ABO/Rh combination:

Visualizing the table above in this way really emphasizes the relative rarity of Rh-negative blood types in this population.

Note that sum of all the probabilities is 1:

If a large number of people have their blood typed, the proportions of people with each blood type should be approximately equal to the overall proportions in the table and figure above.

Combining probabilities

Because of independent assortment, each of the 8 possible ABO types + Rh factor are independent of each other. We can use the rules of addition to determine the probability that a randomly selected individual has a certain blood type. The first few are provided for you. See if you can figure out the rest (here we are just using R as a fancy calculator).

Type O blood

Type AB blood

Type A or type B blood with any Rh factor

Any ABO type with Rh+

Not Type AB blood

Sampling from a population

The probabilities the the table above represent what are known as “long-run” probabilities, meaning that if we sample more and more people (up to the entire population size), the proportions that we observe will match the expected probabilities.

What happens if we sample from a smaller sample? Questions like this become very important when thinking about blood donation programs. Not everyone donates blood and the specific population from which blood is sampled can have a dramatic effect on the relative supply of different blood types in a community.

The function below mimics the process of blood donations from a random sample of people. Random samples (i.e., donors) are drawn from a distribution where the probability of each of teh eight blood types occurring is the same as in the table above.

The gray bars are the expected counts, and the red points are the observed counts in this sample. Run the code a few times and examine the plot.

What do you observe about the distribution of counts in the observed sample (red points) compared to the predicted probabilities (gray bars)? What does this tell you about the need to specifically recruit donors with rare blood types?

Go back to the code block above. Try raising the number of people sampled and see how the distribution changes?

What looks different when 1000 or 10000 people are sampled?

What is a confidence interval?

Confidence intervals for proportions

Effect of sample size on CIs

Confidence intervals for blood types

Statistical test for proportions

Goodness of fit and the chi-squared test

The chi-squared test2

Statistical test in reverse – significant test is a lack of fit

\[\chi^2 = \sum_{i=1}^n \frac{(Observed_i - Expected_i)^2}{Expected_i}\]

Using a Chi-squared test

Days of birth

Day of the week for 350 births

What is the predicted number of births per day?

What is your prediction?

Case study: Detecting a recessive muscle mutation in mice

References

Dean, L. 2005. Blood Groups and Red Cell Antigens. National Center for Biotechnology Information (US).
Klug, W. S., M. R. Cummings, C. A. Spencer, M. A. Palladino, and D. Killian. 2019. Concepts of Genetics. 12th ed. Pearson.

Footnotes

  1. We are happy to provide additional resources. Just ask.↩︎

  2. Also called the chi-square test or the \(\chi^2\) test↩︎